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Abstract 


Many  of  today's  high-perforniance  con5)uter  processors  are  super-scalar.  They  can  dispatch 
multiple  instructions  per  cycle  and,  hence,  provide  what  is  commonly  referred  to  as 
instmction-level  parallelism.  This  super-scalar  cj^ability,  combined  with  software  pipelining,  can 
increase  processor  throughput  dramatically.  Achieving  maximum  throughput,  however,  is 
nontriviaL  Con5>ilers  must  engage  in  aggressive  optimization  techniques,  such  as  loop  unrolling, 
speculative  code  motion,  etc.,  to  structure  code  to  take  full  advantage  of  the  underlying  computer 
architecture.  The  phase-ordering  inqjlications  of  these  optimizations  are  not  well  understood  and 
are  the  subject  of  continuing  research.  Of  particular  interest  are  optimizations  that  enhance 
instruction-level  parallelism.  Two  of  these  are  loop  unrolling  and  loop  fusion.  These  are 
source-level  optimizations  that  can  be  performed  by  either  the  programmer  or  the  compiler. 
These  optimizations  have  dramatic  effects  on  the  conqjiler's  instruction  scheduler.  Performed  too 
aggressively,  these  optimizations  can  increase  register  pressure  and  result  in  costly  memory 
references.  This  paper  details  expaiments  performed  to  measure  the  effects  of  these  source-level 
code  transformations  and  how  they  influenced  register  pressure  and  code  performance. 
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1  Introduction  and  Problem  Statement 


Modern  high-performance  computer  platforms  are  capable  of  achieving  incredible  levels 
of  code  execution  speed.  One  way  they  increase  performance  is  by  taking  advantage  of 
parallelism  found  in  algorithms.  To  this  end,  many  of  these  systems  offer  multiprocessor 
parallelism.  Furthermore,  many  also  offer  software  pipelining  to  take  full  advantage  of  low- 
level,  or  code-level,  parallelism  [1].  This  is  parallelism  actually  present  in  the  way  machine 
instructions  axe  dispatched. 

Also  of  paramount  importance  is  that  these  machines  take  fuU  advantage  of  their  com¬ 
plicated  memory  systems.  Most  of  the  standard  optimization  techniques  will  really  only 
provide  maximum  performance  if  the  computer’s  memory  system  is  being  used  in  an  efficient 
manner.  Most  shared-memory  architectures  have  some  type  of  memory  hierarchy.  The  main 
reason  memories  are  implemented  in  this  fashion  is  to  optimize  the  price-performance  ratio, 
given  the  widening  gap  between  central  processing  unit  (CPU)  speed  and  main  memory 
performance.  CPU  speeds  are  currently  doubling  about  every  2  or  3  years  while  the  speed 
of  main  memory  has  historically  doubled  only  about  every  decade.  These  tiered  memories, 
with  their  nondeterministic  behavior,  are  hard  to  manage  and  predict.  This  makes  the  job 
of  the  compiler’s  code  generator  that  much  more  difficult.  Memory  systems  have  become 
so  complicated  on  some  architectures  that  slight  memory  reference  changes  on  codes  may 
speed  up  or  slow  down  execution  by  an  order  of  magnitude. 

Since  registers  provide  fast  data  access,  one  goal  of  the  compiler  back  end  is  to  allocate 
and  assign  registers  in  an  effective  manner.  The  register  allocator  tries  to  assign  a  register  to 
each  register  candidate.  Since  register  access  is  very  fast,  the  compiler  should  generate  code 
that  reuses  these  assigned  registers  as  much  as  possible.  To  do  this,  the  register  allocator 
and  scheduler  should  work  closely  together  [2].  This  is  actually  a  very  comphcated  phase¬ 
ordering  problem.  At  the  least,  the  scheduler  should  order  code  in  a  way  that  instruction- 
level  parallelism  can  be  exploited,  and  the  register  allocator  should  give  top  priority  to 
assigning  a  register  to  frequently  used  variables.  Several  source-level  optimizations  can  be 
performed  with  the  goal  of  increasing  memory  locality  and  instruction-level  paralleUsm  and 
thus  assisting  the  code  generator. 

The  problem,  however,  is  that  when  these  optimizations  are  pursued  too  aggressively, 
they  can  reach  a  point  of  diminishing  returns.  When  the  compiler  starts  to  run  out  of 
available  registers  to  use,  register  pressure  is  said  to  be  high.  At  the  point  where  registers 
are  no  longer  available,  the  register  allocator  must  actually  “spill”  a  register’s  content  to 
memory  to  free  it  for  other  uses  [3].  On  tiered  memory  machines,  such  an  action  can  be 
detrimental  to  varying  degrees.  If  the  value  is  written  to  cache,  the  access  time  is  very 
small,  but  the  cache  manager  may  still  have  to  invahdate  the  cache  line.  This  operation 
can  cause  the  cache  line  to  be  rewritten  to  main  memory.  Actual  ma.in  memory  access 
can  be  very  expensive.  On  the  Silicon  Graphics  (SGI)  Power  Challenge  architecture,  this 
delay,  though  seldom  reaching  this  point,  can  be  as  high  as  90  cycles.  Accordingly,  the 
governing  hypothesis  for  this  study  is  that  memory  locality  and  code-level,  parallelism¬ 
enhancing  transformations  are  beneficial  only  to  the  point  where  register  pressure  becomes 
very  high. 
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2  Experimental  Methodology 

The  example  codes  listed  in  this  paper  were  all  run  on  the  SGI  Power  Challenge  architecture. 
This  is  a  64-bit  architecture  using  75-MHz  MIPS  R8000  processors.  There  are  32  64-bit 
floating-point  registers  available  to  the  assembler.  The  architecture  is  superscalar  and  can 
dispatch  up  to  four  instructions  per  cycle.  Prefetching  is  not  implemented  in  the  Power 
Challenge  architecture.  This  machine  uses  a  hierarchical  memory  structure  like  the  one 
described  previously. 

Loop  unrolling  and  loop  fusion  were  the  two  transformations  that  were  studied  in  this 
experiment.  These  are  common  transformations,  and  loop  unrolling  especially  is  the  most 
heavily  used  transform  to  increase  instruction-level  parallelism.  All  programs  were  writ¬ 
ten  in  C  and  were  compiled  with  the  SGI  MIPSpro  compiler  version  6.0.2.  Two  compile 
options  were  used.  The  first  was  -02,  which  turns  on  extensive  optimization.  These  opti¬ 
mizations  are  conservative  in  that  they  almost  always  provide  some  speedup  and  maintain 
floating-point  accuracy.  The  second  was  -03,  which  is  aggressive  optimization.  The  main 
consequence  of  -03  optimization  is  that  it  turns  on  software  pipelining.  The  code  scheduler 
attempts  to  pipeline  innermost  loops  whenever  possible. 

A  discovery  made  halfway  through  these  trials  led  to  a  small  change  in  the  analysis  of  the 
results.  In  version  6.0.1  of  the  compiler  system  (running  at  the  University  of  Delaware),  the 
pipeline  scheduler  would  give  up  if  it  could  not  generate  a  schedule  without  register  spilling. 
The  newer  version  of  the  compiler  (running  at  the  Army  Research  Laboratory),  however, 
will  still  schedule  pipelined  loops  with  spill  code  introduced.  The  general  hypothesis  remains 
the  same.  The  new  twist  is  that  spilling  will  limit  pipelining  usefulness. 

3  Results 

3.1  Loop  Unrolling 

Loop  unrolling  replicates  the  body  of  a  loop  some  number  of  times  known  as  the  unrolling 
factor.  Loop  unrolling  has  the  ability  to  increase  performance  in  two  ways.  First,  it  reduces 
loop  overhead  by  performing  less  compare  and  branch  instructions.  Second,  it  increases  work 
performed  in  the  resulting  larger  loop  body  by  allowing  more  opportunity  for  optimization 
and  register  usage.  Most  of  the  increase  in  performance  speed  on  the  SGI  is  because 
multiplication  and  addition  instructions  may  be  overlapped  in  the  multiple  instruction  cycle. 

A  simple  2-D  matrix  multiply  code  fragment  was  used  to  test  unrolling  efiects  on  the 
R8000  processor.  This  code  is  listed  in  Appendix  A.  Four  unrolled  versions  of  the  matrix 
multiply  were  implemented  in  difierent  functions.  There  is  a  caveat.  The  author  does  not 
claim  this  code  to  be  the  best  version  of  matrix  multiply  possible.  Simply,  the  base  version 
is  straightforward  and  provides  a  good  example  of  unrolling  for  memory  locality.  Other 
C  codes  with  loop  reordering  and  splitting  will  undoubtedly  come  closer  in  reaching  near- 
theoretical  peak  on  the  SGI  architecture  than  these  versions.  The  function  MM_basic  is  the 
basic  matrix  multiply  loop.  The  optimizer  unrolled  the  inner  loop  four  times  when  this  was 
compiled.  The  hand-coded  unrolling  of  the  other  functions  was  performed  on  the  outer  and 
middle  loop  nests.  The  exact  unrolling  can  be  seen  in  the  code  listing  in  Appendix  A.  The 
optimizer  did  not  unroll  the  inner  loop  in  these  cases. 
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Various  data  were  collected  during  program  execution.  The  results  are  displayed  in  Table 
1.  Column  one  lists  the  function  name.  Columns  “-02”  and  “-03”  list  the  run  times  of  the 
code  compiled  with  the  two  flags,  respectively.  The  rest  of  the  columns  pertain  only  to  the 
executable  compiled  with  -03  optimization.  “Cycles/Iteration”  lists  how  many  computer 
cycles  were  required  to  perform  one  complete  iteration  of  the  inner  loop.  For  instance, 


Table  1:  Matrix  multiply  performance  and  other  statistics. 


Function 

Run  Time  (sec) 

-03  Compiled  Executable  | 

-02 

-03 

Cycles/Iteration 

FLOPS (%) 

Memory  (%) 

MM-basic 

94.37 

91.84 

0.67 

33 

100 

MM-unrolLl 

20.17 

14.59 

0.58 

85 

71 

MM_unroll_2 

13.38 

10.45 

0.56 

88 

66 

MM_unrolL3 

14.23 

14.28 

0.67 

74 

44 

in  MM_unroll_2,  both  the  outer  and  middle  loops  were  unrolled  four  times.  The  inner  loop, 
therefore,  actually  completes  16  iterations  each  time  it  is  executed.  The  scheduler  reported 
this  loop  to  be  pipelined  with  a  steady-state  of  nine  cycles  per  iteration.  In  this  case,  the 
number  reported  in  the  table  is  derived  from  dividing  the  steady  state  nmnber  of  cycles  by 
the  total  number  of  computations  performed  by  one  iteration  of  the  inner  loop.  FLOPS 
gives  the  compiler-calculated  rate  of  floating-point  operations  per  second  based  on  the  MIPS 
RSOOO’s  ability  to  perform  two  such  operations  per  cycle.  Memory  lists  the  percentage  of 
peak  memory  references  achieved.  The  maximum  is  two  each  cycle. 

As  evident  from  the  timing  profiles,  the  function  MM_basic  is  perhaps  the  worst  way 
of  performing  a  matrix  multiply.  This  poor  performance  results  from  the  inefficient  way 
in  which  memory  is  being  utilized.  The  best  way  to  check  on  memory  performance  is 
through  profiling.  Two  profiUng  mechanisms  are  available  on  the  SGI  operating  system: 
prof  and  pixie.  Comparison  of  their  outputs  tells  on  a  procedure-by-procedme  basis  how 
well  the  memory  system  is  performing.  Prof  uses  program  counter  sampling  to  collect  data. 
It  interrupts  the  code  periodically  and  records  the  location  of  the  program  counter.  The 
condensed  prof  output  is  listed  as  follows: 


samples  time  (7.) 


cum  time (7)  procedure  (file) 


28046  2.8e+02s(  34.1) 
18403  1.8e+02s(  22.4) 
18236  1.8e+02s(  22.2) 
17263  1.7e+02s(  21.0) 


2.8e+02s(  34.1) 
4.6e+02s(  56.6) 
6.5e+02s(  78.8) 
8.2e+02s(  99.8) 


MM_basic 

MM_uiiroll_l 

MM_unroll_2 

MM_uiiroll_3 


In  contrast,  pixie  instruments  the  code  with  counters  at  the  beginning  and  end  of  basic 
blocks.  It  counts  only  the  number  of  cycles  the  program  executes  and  does  not  account  for 
cache  misses,  bank  conflicts,  etc.  The  abbreviated  pixie  output  is  given  as  follows: 


cycles  (•/,) 

14883848018(28.12) 

12760324018(24.10) 

12630242018(23.86) 

12540096818(23.69) 


cum  7, 

secs 

28.12 

198.45 

52,22 

170.14 

76.08 

168.40 

99.77 

167.20 

instrns  calls 

29765772028 
26840486028 
26610363028 
26484145228 


procedure (file) 

1  MH_basic 
1  MM.unroll.l 
1  MH_uiiroll_2 
1  MM_unroll_3 
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Optimizers  can  affect  the  accuracy  of  profiling.  Therefore,  the  profiled  executables  were 
created  with  optimizations  disabled. 

In  the  best  case,  pixie  is  reporting  that  MM.basic  should  complete  in  about  198  seconds. 
Prof  is  showing  that  it  is  taking  about  280  seconds.  This  shows  that  the  current  structure 
of  the  code  is  not  working  well  with  the  memory  system.  The  unrolled  code  fragments 
dramatically  illustrate  the  advantages  of  loop  unrolling.  In  these  cases,  loop  unrolling  was 
the  means  to  achieve  register  (or  loop)  blocking.  By  unrolling  the  various  loops,  loads  and 
stores  for  several  array  elements  were  highly  reduced.  The  memory  system  performed  much 
better,  as  evident  from  the  run  time  as  well  as  the  closely  matched  times  given  in  the  prof 
and  pixie  profiles. 

Getting  the  maximum  benefits  from  a  compiler  usually  requires  having  a  detailed  knowl¬ 
edge  of  the  many  optional  flags  to  control  the  fine  points  in  the  compiling  process.  The 
MlPSpro  compiler  is  no  different.  With  standard  options,  the  compiler  could  not  pipeline 
the  loop  body  for  MMmnrolLS  because  the  loop  body  was  too  long.  The  compile  option 
-SWP:body_ins=250  was  used  to  increase  the  maximum  size  of  a  loop  body  that  would  be 
considered  for  software  pipelining. 

Loop  unrolling  led  to  great  speed  increases.  Unrolling  with  pipehning  allowed  the  basic 
matrix  multiply  to  execute  at  33%  efficiency.  Without  unrolling,  efficiency  is  only  around 
10%  of  the  maximum  throughput.  Loop  unrolling  with  the  goal  of  register  blocking  achieved 
even  greater  results.  The  software  pipeliner,  which  allows  differing  loop  iterations  to  overlap, 
was  able  to  achieve  speedup  over  standard  -02  optimization  in  almost  every  case. 

Unrolling  does  reach  a  point  of  maximum  usefulness  in  these  test  cases.  With  each  func¬ 
tion,  more  and  more  unrolling  was  done  in  order  to  promote  register  reuse  and  instruction- 
level  parallelism.  MMmnroll^  has  extensive  unrolling  but  does  not  produce  any  spill  code. 
Implementation  of  MMmnrolL3  however,  produces  extensive  spilling.  A  quick  check  of  the 
statistics  reported  in  Table  1  graphically  show  that  the  point  has  been  reached  at  which 
unrolling  is  harming  execution  speed.  For  MM_unroll-3,  the  cycles/iteration  is  higher  and 
the  FLOP  rate  is  smaller  than  those  figures  reported  for  MM_unroll_2.  The  actual  output 
from  the  scheduler  is  listed  next.^ 


#<swps> 

#<swps> 

#<swps> 

#<swps> 

#<swps> 

#<swps> 

#<swps> 

#<swps> 

#<swps> 

#<swps> 

#<swps> 


Not  unrolled  before  pipelining 
27  cycles  per  iteration 


40  flops  (  74*/,  of  peak) 

40  madds  (  74%  of  peak) 

24  mem  refs  (  44%  of  peak) 

3  integer  ops  (  5%  of  peak) 

67  instructions  (  62%  of  peak) 

32  fgr  registers  used. 

29  restores  introduced. 

14  possible  stall  cycles 
14  min  possible  stall  cycles 


(madds  count  as  1) 


As  expected,  register  spilling  does  hurt  the  speed  of  execution  in  this  case.  A  massive 
amount  of  spills  and  restores  has  been  added  by  the  scheduler,  and  even  pipelining  cannot 
hide  the  resultant  delays. 

In  the  pipeline  message,  there  is  a  statement  saying  14  possible  stall  cycles  may  exist. 
Most  stalls  on  this  processor  occur  due  to  the  floating-point  unit  and  the  integer  unit  of 

^Complete  pipeliner  messages  for  the  example  codes  are  listed  in  Appendices  C  and  D. 
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the  CPU  becoming  unsynchronized.  Several  factors  may  lead  to  this  occurrence.  Indirect 
addressing,  such  as  a  [b  [i]  ] ,  will  cause  the  lookup  of  b  [i]  to  complete  before  the  load/store 
can  begin.  Multidimensional  arrays,  prevalent  in  these  examples,  lead  to  similar  problems. 
These  integer  unit  operations  paired  together  with  the  many  floating-point  multiplication 
operations  may  be  the  reason  the  compiler  is  warning  of  worst-case  synchronization  stalls. 
How  much  of  the  degradation  in  MM.imrolLS  is  attributable  to  stall  cycles  and  how  much 
is  attributable  to  spilling  is  hard  to  determine. 

3.2  Loop  Fusion 

Loop  fusion  is  a  process  where  two  or  more  adjacent  loops  are  merged  into  a  single  loop. 
Loop  fusion  has  the  potential  to  increase  performance  by  reducing  loop  overhead  and  in¬ 
creasing  instruction-level  parallelism. 

A  somewhat  contrived  example  was  used  to  test  fusion  on  the  R8000  processor.  The 
loops  were  deliberately  designed  to  give  variables  long  live  ranges  and  hence  to  make  things 
as  difliicult  as  possible  for  the  scheduler  to  achieve  scheduling,  not  to  mention  pipelining, 
without  introduction  of  some  spill  code.  The  code  is  hsted  in  Appendix  B.  The  loops  in  the 
NotFused  function  are  named  loopl,  loop2,  and  loopS.  Table  2  lists  the  execution  results. 

Table  2:  Loop  Fusion  Performance  and  Other  Statistics. 


Function 

Run  Time  (sec) 

-03  compiled  executable  1 

-02 

-03 

Cycles/Iteration 

FLOPS (%) 

Memory  (%) 

NotFused 

0.97 

0.81 

loopl 

18.0 

33 

100 

loop2 

11.0 

68 

95 

loops 

11.0 

68 

95 

Fused 

0.67 

1.26 

48.0 

43 

53 

If  the  loops  are  not  pipelined,  the  fused  loop  does  indeed  outperform  the  three  separate 
loops.  The  multiple  compare  and  branch  instructions  executed  in  the  three  loops  can  be 
extremely  costly  because  they  often  interfere  with  maximum  instruction  issue  per  cycle.  In 
this  case,  the  reduction  in  loop  overhead  increased  code  speed  by  1.4.  The  fused  loops  did 
create  a  small  amount  of  spill  code,  but  the  effects  seem  to  be  negligible  compared  to  cycles 
lost  on  compare  and  branch  instructions. 

For  pipelining,  however,  the  spill  code  seemed  to  cause  a  greater  problem.  The  pipeliner 
reported  numerous  potential  problems: 


#<swps> 

#<swps> 

48 

#<swps> 

42 

#<swps> 

0 

#<swps> 

51 

#<swps> 

6 

#<swps> 

99 

#<swps> 

32 

#<swps> 

3 

Not  unrolled  before  pipelining 
cycles  per  iteration 

flops  (  43%  of  peak)  (madds  count 

madds  (  0%  of  peak) 

mem  refs  (  53%  of  peak) 

integer  ops  (  6%  of  peak) 

instructions  (  51%  of  peak) 

fgr  registers  used. 

spills  5  restores  introduced. 


1) 
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#<swps>  25  possible  stall  cycles 

#<swps>  11  min  possible  stall  cycles 

#<swps>  26  min  cycles  required  for  resources 

#<swps>  48  cycle  schedule  register  allocated. 

#<swps>  30  min  cycles  required  for  resources  with  additional  memory  refs. 

#<swps>  30  min  cycles  required  for  recurrences  with  additional  memory  refs. 

Stalls  are  once  again  present,  and  this  time  there  is  a  warning  about  the  number  of  cycles 
required  to  deal  with  resources  and  recurrences  with  memory  references.  All  of  these  prob¬ 
lems  seem  to  have  a  very  bad  cumulative  effect  on  the  final  performance.  This  code  was  not 
written  with  any  regard  to  memory  locality.  The  three  loops  taken  separately  and  pipelined 
performed  fairly  well,  but  could  still  not  perform  as  well  as  the  fused,  nonpipelined  loop. 
The  spill  code  in  the  pipelined  fused  loop  has  put  extreme  burdens  on  the  memory  system 
and  has  caused  a  severe  loss  of  performance. 

4  Conclusion 

To  be  of  maximum  usefulness,  the  scheduler  of  a  compiler  must  be  able  to  fully  take  into 
account  the  extremely  complicated  memory  systems  in  most  of  today’s  shared-memory, 
high-performance  computers.  As  has  been  shown  from  these  examples,  transformations 
to  increase  memory  locality  as  well  as  reduce  loop  overhead  and  promote  instruction-level 
parallelism  can  be  extremely  advantageous.  They  are,  however,  extremely  interrelated,  and 
promoting  one  often  takes  place  with  the  detriment  of  the  other. 

Is  there  a  best  choice  for  ordering  these  transformations  or  some  way  of  knowing  how 
much  of  one  to  perform?  Building  such  knowledge  into  the  compiler  will  be  very  difl&cult. 
Optimal  scheduling  is  itself  an  NP-complete  problem,  and  predicting  memory  system  behav¬ 
ior  is  difficult.  Building  extensive  information  about  the  memory  system  into  the  compiler 
will  undoubtedly  greatly  increase  compile  time  and  with  the  nondeterininistic  behavior  of 
the  memory  system  still  have  the  potential  to  not  be  totally  accurate.  Those  codes  worthy 
of  extensive  analysis  and  optimization  will  probably  be  best  served  by  having  compilers  that 
generate  detailed  messages  about  the  actions  they  took  that  will  allow  the  programmer  to 
make  more  informed  choices  about  optimizing  source-level  code  structure.  It  seems  that 
only  through  profihng  and  modifying  code  by  hand  can  maximum  performance  be  achieved 
on  a  per-architecture  basis.  Some  general  conclusions  are  noteworthy,  however: 

•  Loop  unrolling  is  very  efficient  at  promoting  instruction-level  parallelism. 

•  Loop  fusion  is  very  efficient  at  removing  costly  compare  and  branch  instructions  and 
may  be  more  efficient  than  pipelining  in  some  cases. 

•  Large  loop  bodies  with  somewhat  random  or  erratic  memory  access  patterns  will 
seldom  benefit  from  pipelining.  These  loops  will  either  be  better  off  not  pipelined  or 
distributed  and  then  pipelined  if  possible. 

•  Codes  written  that  take  into  account  the  memory  system  should  in  most  cases  benefit 
from  pipelining. 

•  Loop  unrolling  to  promote  register  reuse  is  only  efficient  to  just  prior  to  the  point 
where  spill  code  must  be  introduced. 
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A  Loop  Unrolling  Examples 


♦include  <stdio-h> 

♦include  <stdlib.h> 

♦include  <sys/tiBie.h> 

♦define  N  1000 
♦define  L  480 
♦define  M  1000 

double  a[M][L],  bCLlCN],  c[M][N]; 

void  initO  { 

int  i ,  j ,  k ; 

for  (i=0;  i<M;  i++) 
for  (k=0;  k<L;  k++) 
a[i]Ck]  =  drand48(); 

for  (k=0;  k<L;  k++) 
for  (j=0;  j<N;  j++) 
b[k][j3  =  drand48(); 

for  (i=0;  i<M;  i++) 
for  (j=0;  j<N;  j++) 
cCi]  Cj]  =  0.0; 


} 


void  MM_basic()  { 
int  i ,  j  ,  k ; 

StaurtTimerO ; 

for  (j=0;  j<N;  j++) 
for  (k=0;  k<L;  k++) 
for  (i=0;  i<M;  i++) 

cCi3Cj]  =  cCi3Cj3  +  aCi3Ck3  *  b[k3Cj3; 

StopTimerO  ; 

printf  (”%f \n” ,  c  [13  [1]  ) ; 

> 


void  MM„unroll_l()  { 
int  i,  j,  k; 

StartTimerO ; 


for  (j=0;  j<N;  j+=2) 
for  (k=0;  k<L;  k+=6) 
for  (i=0;  i<M;  i++)  < 

c[i][j+0]  =  cCi]Cj+03  +  a[i3[k+03 

c[i3[j+0]  =  c[i][j+03  +  a[i3[k+l3 

c[i3[j+03  -  c[i3[j+0]  +  a[i3[k+23 

c[i3Cj+0]  =  c[i3[j+0]  +  a[i3Ck+33 

c[i]Cj+0]  =  c[i3[j+0]  +  a[i3[k+43 

cCi3Cj+0]  =  c[i3[j+03  +  a[i3  [k+53 


*  b[k+0][j+0] 

*  b[k+l][j+03 

*  b[k+2]Cj+03 

*  b[k+33  [j+03 

*  b[k+43[j+03 

*  b[k+53[j+0] 


c[i][j+l3  =  c[i3[j+l]  +  a[i][k+0]  *  b[k+0][j+l3; 
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c[i][j+l]  =  cCi3Cj+l]  +  a[i][k+l]  *  bCk+l][j+l3 
cCi]Cj+l]  =  c[i][j+l3  +  aCi][k+2]  *  b[k+2]  [j+l] 

cCi]Cj+l]  =  cCi][j+l3  +  a[i]  [k+3]  *  b[k+3]  [j+l] 

c[i3Cj+l]  =  cCi]Cj+l]  +  a[i]Ck+4]  *  b[k+4]Cj+l3 

c[i]Cj+l]  =  cCi][j+l]  +  a[i]  [k+5]  *  b[k+5]Cj+l] 

} 

StopTimerO  ; 

printf  (■’•/.f  \n'’ ,  c  [1]  [1]  )  ; 

> 


void  MM.'uiiroll_2()  { 
int  i,  j,  k; 

StartTimerO ; 

for  (j=0;  j<N;  j+=4) 
for  (k=0;  k<L;  k+=4) 
for  (i=0;  i<M;  i++)  *C 

cCi]Cj+0]  =  c[i]Cj+0]  +  aCi][k+0]  ♦  b[k+0]Cj+0] 

c[i3Cj+0]  =  cCi]Cj+0]  +  a[i][k+l]  *  b[k+l][j+0] 

cCi][j+0]  =  c[i][j+0]  +  a[i]  Ck+2]  *  b[k+2][j+0] 

cCi][j+0]  =  cCi][j+0]  +  a[i]Ck+3]  *  b[k+3]Cj+0] 

cCi3Cj+l]  =  c[i3Cj+i3  +  a[i][k+03  *  bCk+03Cj+l3 

c[i3Cj+l3  =  cCi3Cj+l3  +  a[i][k+l3  *  bCk+1]  [j+l] 

c[i][j+l]  =  c[i3[j+l3  +  a[i][k+2]  *  b[k+23[j+l3 

c[i3Cj+l3  =  cCi][j+l3  +  a[i][k+3]  *  b  [k+3]  [j+l] 

c[i][j+2]  =  c[i][j+2]  +  a[i][k+0]  *  b[k+0]  [j+2] 

c[i3[j+2]  =  c[i][j+2]  +  a[i][k+l3  *  b[k+l]  [j+2] 

c[i][j+23  =  c[i][j+23  +  a[i3[k+23  *  b[k+2]  [j+2] 

c[i][j+2]  =  c[i][j+2]  +  a[i3[k+33  *  b [k+3]  [j+2] 

c[i][j+3]  =  c[i][j+33  +  a[i]  [k+0]  *  b[k+0]  [j+3] 

c[i3[j+3]  =  c[i][j+33  +  a[i][k+l]  *  b[k+l]  [j+33 

c[i3[j+3]  =  c[i][j+3]  +  a[i][k+2]  *  b[k+2][j+3] 

c[i][j+3]  =  c[i][j+3]  +  a[i][k+3]  *  b [k+3]  [j+3] 

> 

StopTimerO ; 

printf  (  "%f  \ii" ,  c  [1]  [1]  )  ; 

} 


void  MM_unroll_30  { 
int  i,  j,  k; 

StartTimerO ; 

for  (j=0;  j<N;  j+=10) 
for  (k=0;  k<L;  k+=4) 
for  (i=0;  i<M;  i++)  { 

c[i][j+0]  =  c[i3[j+0]  +  a[i][k+0]  *  b[k+0][j+0]; 

c[i3[j+0]  =  c[i][j+03  +  a[i][k+l]  ♦  b[k+l3[j+0]; 

c[i3[j+0]  =  c[i][j+0]  +  a[i]  [k+23  ♦  b[k+2][j+0]; 

c[i][j+0]  =  c[i][j+0]  +  a[i][k+3]  *  b[k+3][j+0]; 

c[i][j+l]  =  c[i][j+l]  +  a[i][k+0]  *  b[k+0][j+l]; 
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cCi]Cj+l]  =  c[i][j+l]  +  a[i3[k+l]  *  bCk+l3Cj+l] 

c[i3[j+l3  =  c[i3[j+l3  +  aCi3[k+23  *  b[k+23  Cj+l3 

cCi3[j+l3  =  C[i3[j+13  +  a[i3Ck+33  *  bCk+33Cj+l3 

c[i3[j+23  =  cCi3[j+23  +  aCi3Ck+03  *  b[k+03Cj+23 

cCi3[j+23  =  cCi3Cj+23  +  aCi3Ck+l3  *  bCk+l3[j+23 

cCi3[j+23  =  cCi3Cj+23  +  a[i3Ck+23  ♦  b[k+23  Cj+23 

c[i3[j+23  =  cCi3  Cj+23  +  a[i3Ck+33  *  b[k+33  Cj+23 

cCi3Cj+33  =  cCi3Cj+33  +  aCi3  Ck+03  *  bCk+03  Cj+33 

cCi3Cj+33  =  cCi3  Cj+33  +  aCi3Ck+13  *  bCk+l3  Cj+33 

cCi3  Cj+33  =  cCi3  Cj+33  +  aCi3Ck+23  *  bCk+23  Cj+33 

cCi3  Cj+33  =  cCi3  Cj+33  +  aCi3Ck+33  *  bCk+33  Cj+33 

cCi3Cj+43  =  cCi3Cj+43  +  a  Ci]  Ck+03  *  b  Ck+03  Cj +43 

cCi3Cj+43  -  cCi3Cj+43  +  aCi3Ck+l3  *  bCk+l3Cj+43 

cCi3Cj+43  =  cCi3Cj+43  +  aCi3Ck+23  *  bCk+23Cj+43 

cCi3Cj+43  =  cCi3Cj+43  +  aCi3  Ck+33  *  bCk+33Cj+43 

cCi3Cj+53  =  cCi3Cj+53  +  aCi3  Ck+03  *  b  Ck+03  Cj +53 

cCi3Cj+53  =  cCi3Cj+53  +  aCi3  Ck+13  *  bCk+l3Cj+53 

cCi3Cj+53  =  cCi3Cj+53  +  aCi3  Ck+23  *  bCk+23Cj+53 

cCi3Cj+53  =  cCi3Cj+53  +  aCi3  Ck+33  *  b  Ck+33  Cj +53 

cCi3Cj+63  =  cCi3Cj+63  +  aCi3  Ck+03  ♦  bCk+03Cj+63 

cCi3Cj+63  =  cCi3Cj+63  +  aCi3  Ck+13  *  bCk+l3Cj+63 

cCi3Cj+63  =  cCi3Cj+63  +  aCi3  Ck+23  *  b Ck+23  Cj+63 

cCi3Cj+63  =  cCi3  Cj+63  +  a  C 13  Ck+33  *  b  Ck+33  Cj+63 

cCi3Cj+73  =  cCi3Cj+73  +  a C 13  Ck+03  *  b Ck+03  Cj+73 
cCi3Cj+73  =  cCl3  Cj+73  +  aCl3  Ck+13  *  bCk+l3  Cj+73 

cCi3  Cj+73  =  cCl3  Cj+73  +  aCl3  Ck+23  *  b Ck+23  Cj+73 

cC13  Cj+73  =  cC13Cj+7]  +  aCl3Ck+33  *  b Ck+33  Cj+73 

cCi3Cj+83  =  cCl3Cj+83  +  aCi3  Ck+03  *  b Ck+03  Cj+83 

cCl3Cj+83  =  cCl3  Cj+83  +  aCl3  Ck+13  *  bCk+l3  Cj+83 

cCi3  Cj+83  =  cCl3  Cj+83  +  aCl3  Ck+23  *  bCk+23  Cj+83 

cCl3  Cj+83  =  cCl]  Cj+83  +  aCl3  Ck+33  *  b Ck+33  Cj+83 

cCl3Cj+93  =  cC13Cj+93  +  aCl3  Ck+03  ♦  bCk+03Cj+93 

cCi3Cj+93  =  cCl3Cj+93  +  aCl3  Ck+13  *  bCk+l3Cj+93 

cCl3Cj+93  =  cCl3Cj+93  +  aCl3  Ck+23  ♦  bCk+23  Cj+9] 

cCl3Cj+93  =  cCl]Cj+93  +  aCl3  Ck+33  *  bCk+33Cj+93 

} 


StopTlmerC) ; 

prlntfC'XAn”,  cCl3Cl3); 

} 

malnO  -i 
initO ; 

MM_baslc() ; 

MM.unroll.K) ; 
MM_uiiroll_2() ; 
MM.unroll.aC) ; 

} 
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B 


Loop  Fusion  Examples 


tinclude  <stdio.h> 

tdefine  N  1000 
#defiiie  M  1000 

double  xl [N] ; 
double  x2 [N] ; 
double  x3 [N] ; 
double  x4 [N] ; 
double  x5 [N] ; 
double  x6 [N] ; 
double  x7[N]; 
double  x8[N]; 
double  x9 [N] ; 
double  xlO [N] ; 
double  xll[N]; 
double  xl2 [N] ; 

double  yl [N] ; 
double  y2 [N] ; 
double  y3 [N] ; 
double  y4 [N] ; 
double  ySCN]; 
double  y6 [N] ; 
double  y7 [N] ; 
double  y8 [N] ; 
double  y9 [N] ; 
double  ylOCN]; 
double  yll[N3j 
double  yl2[N3; 

double  zl [N] ; 
double  z2 [N] ; 
double  z3 [N] ; 
double  z4CN3 ; 
double  z5 [N] ; 
double  z6 [N] ; 
double  z7CN]; 
double  z8 [N] ; 
double  z9 [N] ; 
double  zlO [N] ; 
double  zllCN]; 
double  Z12CN]; 

double  alCN]; 
double  a2CN]; 
double  a3CN]; 
double  a4[N]  ; 
double  a5 [N] ; 
double  a6[N]; 
double  a7CN]; 
double  a8 [N] ; 
double  a9 [N] ; 

double  blCN]; 
double  b2CN] ; 
double  b3CN]; 
double  b4CN]; 
double  bSCN]; 
double  b6CN3; 


double  cl  [N] ; 
double  c2[N3; 
double  c3 [N] ; 

void  NotFusedO  •[ 
int  i,  j; 

StartTimerO ; 

for  (j=0;  j<M;  j++)  { 

/*  loopl  */ 
for  (i=0;  i<N;  i++)  { 
x9[i]  =  x8[i]  -  xlCi]; 
xlO[i]  =  x7[i]  -  x2[i]; 
xll[i]  =  x6Ci3  “  x3Ci]; 
xl2[i]  =  x5Ci]  -  x4Ci]; 
y9[i]  =  ySCi]  -  ylCi]; 
ylO[i]  =  y7[i]  -  y2[i]; 
yllCi]  =  y6[i]  -  y3[i]; 
yl2[i3  =  y5Ci]  -  y4Ci]; 
z9Ci]  =  z8[i]  -  zl[i]; 
zlOCi]  =  z7Ci]  -  z2Ci]; 
zllCi]  =  z6[i]  -  z3Ci]; 
zl2[i]  =  z5[i]  -  z4Ci3; 

} 

/*  loop2  */ 

for  (i=0;  i<N;  i++)  { 

al[i]  =  x9[i]  +  xlOCi]  +  xll[i]  +  xl2[i]; 

a2[i]  =  y9Ci]  +  ylOEi]  +  yllCi]  +  yl2Ci]; 

a3[i]  =  z9Ci]  +  zlOCi]  +  zllCi]  +  zl2[i]; 

a4[i]  =  x9Ci]  +  xl2[i]; 

aSCi]  =  xlO[i]  +  xll[i]; 

a6  [i]  =  y9  [i]  +  yl2  [i]  ; 

a7Ci]  =  ylO[i]  +  yllCi]; 

a8Ci]  =  z9[i]  +  zl2[i]  ; 

a9Ci]  =  zlO[i]  +  zll[i]; 

> 

/♦  loop3  */ 

for  (i=0;  i<N;  i++)  { 

blCi]  =  al[i]  +  a2Ci]  +  xlCi]  +  x2Ci]  +  x7Ei] 

b2Ei]  =  a2Ei]  +  a3Ei]  +  x3Ei]  +  x4Ci3  +  y7Ei3 

b3Ei3  =  a3Ei3  +  alEi]  +  x5Ei3  +  x6Ei3  +  z7Ci3 

clEi]  =  ylEi]  +  alEi]  ; 

c2Ei]  =  y2Ei3  +  a2Ci3; 

c3Ei3  =  ySEi]  +  a3Ei3; 

} 


} 

StopTimerO ; 
printfC'y.f\n",  c3El3); 
>  /*  end  NotFused  ♦/ 


void  Fused ()  { 
int  i,  j; 


StaurtTimerO ; 
for  (j=0;  j<M;  j-H-) 


14 


for  (i=0;  i<N;  i++)  { 
x9Ci]  =  x8Ci]  -  xl[i]; 
xlO[i]  =:  x7[i]  -  x2[i]; 

xll[i3  =  x6[i]  -  x3Ci]; 

xl2[i3  =  xSCi]  -  x4Ci3; 

y9[i]  =  y8[i]  -  ylCi]; 
ylO[i]  =  y7Ci]  -  y2[i]; 

yllCi]  =  y6Ci]  “  y3[i]; 

yl2[i]  =  y5[i3  -  y4Ci]; 

z9[i]  =  z8Ci]  -  zlCi]; 
zlO[i]  =  27  [i]  -  z2Ci3; 
zllCi]  =  26  [i]  -  23  [i]; 
zl2Ci]  =  25  [i]  -  24  [i]; 
alCi]  =  x9Ci]  +  xlOCi]  +  xll[i3  +  xl2[i3; 

a2Ci]  »  y9Ci3  +  ylO[i3  +  yll[i3  +  yl2[i3; 

a3Ci3  =  z9Ci3  +  2l0[i3  +  zllCi]  +  2l2[i3; 

a4Ci3  =  x9[i3  +  xl2Ci3; 

a5Ci3  =  xlO[i3  +  xllCi3; 
a6[i3  =  y9[i3  +  yl2[i3; 
a7[i3  =  ylO[i3  +  yll[i3; 
a8  [i3  =  29  [i3  +  zl2  Ci3  : 
a9Ci3  =  zlOCi3  +  zllCi3; 
bl[i3  =  alCi3  +  a2Ci3  +  xlCi3  +  x2[i3  +  x7[i3; 

b2Ci3  =  a2Ci3  +  a3Ci3  +  x3Ci3  +  x4[i3  +  y7[i3; 

b3Ci3  =  a3Ci3  +  al[i3  +  x5Ci3  +  x6Ci3  +  z7Ci3; 

clCi3  =  ylCi3  +  al[i3; 

c2Ci3  =  y2Ci3  +  a2Ci3; 

c3Ci3  =  y3Ci3  +  a3[i3; 

> 


StopTimerO ; 

printf("y.f\n",  c3[l3); 

}  /*  end  Fused  */ 

mainO  { 

NotFusedC) ; 

Fused () ; 

>  /*  end  main  */ 
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C  Loop  Unrolling  Pipeline  Success  Messages 


#<swps> 

#<swps>  Pipelined  loop  line  23  steady  state 
#<swps> 

#<swps>  4  unrollings  before  pipelining 

#<syps>  2  cycles  per  4  iterations 

#<swps>  0  flops  (  0*/  of  peak)  (madds  count  as  2) 

#<swps>  0  flops  (  0*/i  of  peak)  (madds  count  as  1) 

#<swps>  0  madds  (  0%  of  peak) 

#<swps>  4  mem  refs  (100%  of  peak) 

#<swps>  2  integer  ops  (  50%  of  peak) 

#<swps>  6  instructions  (  75%  of  peak) 

#<swps>  0  short  trip  threshold 

#<swps>  3  ireg  registers  used. 

#<swps>  1  fgr  register  used. 

#<swps> 

#<swps> 

#<swps>  Pipelined  loop  line  36  steady  state 
#<swps> 

#<swps>  4  unrollings  before  pipelining 

#<swps>  6  cycles  per  4  iterations 

#<swps>  8  flops  (  33%  of  peak)  (madds  count  as  2) 

#<swps>  4  flops  (  33%  of  peak)  (madds  count  as  1) 

#<swps>  4  madds  (  33%  of  peeik) 

#<swps>  12  mem  refs  (100%  of  peak) 

#<swps>  3  integer  ops  (  25%  of  peak) 

#<swps>  19  instructions  (  79%  of  peak) 

#<swps>  2  short  trip  threshold 

#<swps>  7  ireg  registers  used. 

#<swps>  14  fgr  registers  used. 

#<s«ps> 

#<swps>  6  possible  stall  cycles 

#<swps>  6  min  possible  stall  cycles 

#<swps> 

#<swps> 

#<swps>  Pipelined  loop  line  53  steady  state 
#<swps> 

#<swps>  Not  unrolled  before  pipelining 

#<swps>  7  cycles  per  iteration 

#<swps>  24  flops  (  85%  of  peak)  (madds  count  as  2) 

#<svps>  12  flops  (  85%  of  peak)  (madds  count  as  1) 

#<swps>  12  madds  (  85%  of  peak) 

#<swps>  10  mem  refs  (  71%  of  peak) 

#<swps>  3  integer  ops  (  21%  of  peak) 

#<swps>  25  instructions  (  89%  of  peak) 

#<syps>  4  short  trip  threshold 

#<swps>  11  ireg  registers  used. 

#<swps>  27  fgr  registers  used. 

#<svps> 

#<swps> 

#<swps>  Pipelined  loop  line  84  steady  state 
#<swps> 

#<swps>  Not  unrolled  before  pipelining 

#<swps>  9  cycles  per  iteration 

#<swps>  32  flops  (  88%  of  peak)  (madds  count  as  2) 

#<swps>  16  flops  (  88%  of  peak)  (madds  count  as  1) 

#<svps>  16  madds  (  88%  of  peak) 

i<svps>  12  mem  refs  (  66%  of  peak) 

#<swps>  3  integer  ops  (  16%  of  peak) 

#<swps>  31  instructions  (  86%  of  peak) 

#<swps>  2  short  trip  threshold 

#<swps>  7  ireg  registers  used. 
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#<svps>  32  fgr  registers  used. 

#<swps> 

#<swps>  8  min  cycles  required  for  resources 

#<swps>  9  cycle  schedule  register  allocated. 

#<swps> 

#<swps> 

#<swps>  Pipelined  loop  line  120  steady  state 
#<swps> 

#<swps>  Not  unrolled  before  pipelining 

#<swps>  27  cycles  per  iteration 

#<swps>  80  flops  (  74*/,  of  peak)  (madds  count  as  2) 

#<swps>  40  flops  (  74*/,  of  peak)  (madds  coimt  as  1) 

#<swps>  40  madds  (  74*/,  of  peak) 

#<swps>  24  mem  refs  (  44*/,  of  peak) 

#<swps>  3  integer  ops  (  5*/,  of  peak) 

#<swps>  67  instructions  (  62*/,  of  peak) 

#<syps>  2  short  trip  threshold 

#<swps>  7  ireg  registers  used. 

#<sups>  32  fgr  registers  used. 

#<swps> 

#<sups>  29  restores  introduced. 

#<swps>  14  possible  stall  cycles 

#<sups>  14  min  possible  stall  cycles 
#<swps> 
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D  Loop  Fusion  Pipeline  Success  Messages 


#<sups> 

#<swps>  Pipelined  loop  line  75  steady  state 
#<swps> 

#<swps>  Not  unrolled  before  pipelining 

#<sups>  18  cycles  per  iteration 

#<swps>  12  flops  (  167.  of  peak)  (madds  count  as  2) 

#<swps>  12  flops  (  337.  of  peak)  (madds  count  as  1) 

#<swps>  0  madds  (  07.  of  peak) 

#<swps>  36  mem  refs  (1007  of  peak) 

#<swps>  5  integer  ops  (  137.  of  peak) 

#<swps>  53  instructions  (  737.  of  peak) 

#<swps>  1  short  trip  threshold 

#<swps>  9  ireg  registers  used. 

#<swps>  21  fgr  registers  used. 

#<swps> 

#<swps>  18  possible  stall  cycles 

#<svps>  18  min  possible  stall  cycles 

#<swps> 

#<swps> 

#<svps>  Pipelined  loop  line  90  steady  state 
#<swps> 

#<swps>  Not  unrolled  before  pipelining 

#<swps>  11  cycles  per  iteration 

#<svps>  15  flops  (  347  of  peak)  (madds  co\mt  as  2) 

#<swps>  15  flops  (  687  of  peak)  (madds  count  as  1) 

#<swps>  0  madds  (  07  of  peak) 

#<swps>  21  mem  refs  (  957  of  peak) 

#<svps>  2  integer  ops  (  97  of  peak) 

#<sups>  38  instructions  (  867  of  peak) 

#<swps>  3  short  trip  threshold 

#<swps>  17  ireg  registers  used. 

#<swps>  29  fgr  registers  used. 

#<swps> 

#<svps>  10  possible  stall  cycles 

#<swps>  10  min  possible  stall  cycles 

#<swps> 

#<swps> 

#<swps>  Pipelined  loop  line  102  steady  state 
#<swps> 

#<svps>  Not  unrolled  before  pipelining 

#<syps>  11  cycles  per  iteration 

#<swps>  15  flops  (  347  of  peak)  (madds  count  as  2) 

#<swps>  15  flops  (  687  of  peak)  (madds  count  as  1) 

#<svps>  0  madds  (  07  of  peak) 

#<svps>  21  mem  refs  (  957  of  peak) 

#<svps>  2  integer  ops  (  97  of  peak) 

#<swps>  38  instructions  (  867.  of  peak) 

#<svps>  3  short  trip  threshold 

#<swps>  19  ireg  registers  used. 

#<swps>  21  fgr  registers  used. 

#<swps> 

#<svps>  10  possible  stall  cycles 

#<swps>  10  min  possible  stall  cycles 

#<svps> 

#<swps> 

•<svps>  Pipelined  loop  line  126  steady  state 
#<swps> 

#<svps>  Not  unrolled  before  pipelining 

#<svps>  48  cycles  per  iteration 

t<svps>  42  flops  (  217  of  peak)  (madds  count  as  2) 

t<svps>  42  flops  (  437  of  peak)  (madds  count  as  1) 
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#<swps>  0  madds  (  0%  of  peak) 

#<swps>  51  mem  refs  (  537,  of  peak) 

#<swps>  6  integer  ops  (  67,  of  peaik) 

#<swps>  99  instructions  (  517,  of  peak) 

#<swps>  1  short  trip  threshold 

#<swps>  16  ireg  registers  used. 

#<swps>  32  fgr  registers  used- 

#<swps> 

#<swps>  3  spills  5  restores  introduced. 

#<swps>  25  possible  stall  cycles 

#<swps>  11  min  possible  stall  cycles 
#<swps> 

#<swps>  26  min  cycles  required  for  resources 

#<swps>  48  cycle  schedule  register  allocated. 

#<swps>  30  min  cycles  required  for  resources  with  additional  memory  refs. 

#<swps>  30  min  cycles  required  for  recurrences  with  additional  memory  refs. 

#<swps> 
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